Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.
translated by 谷歌翻译
Solving math word problems is the task that analyses the relation of quantities and requires an accurate understanding of contextual natural language information. Recent studies show that current models rely on shallow heuristics to predict solutions and could be easily misled by small textual perturbations. To address this problem, we propose a Textual Enhanced Contrastive Learning framework, which enforces the models to distinguish semantically similar examples while holding different mathematical logic. We adopt a self-supervised manner strategy to enrich examples with subtle textual variance by textual reordering or problem re-construction. We then retrieve the hardest to differentiate samples from both equation and textual perspectives and guide the model to learn their representations. Experimental results show that our method achieves state-of-the-art on both widely used benchmark datasets and also exquisitely designed challenge datasets in English and Chinese. \footnote{Our code and data is available at \url{https://github.com/yiyunya/Textual_CL_MWP}
translated by 谷歌翻译
We conduct a systematic study of backdoor vulnerabilities in normally trained Deep Learning models. They are as dangerous as backdoors injected by data poisoning because both can be equally exploited. We leverage 20 different types of injected backdoor attacks in the literature as the guidance and study their correspondences in normally trained models, which we call natural backdoor vulnerabilities. We find that natural backdoors are widely existing, with most injected backdoor attacks having natural correspondences. We categorize these natural backdoors and propose a general detection framework. It finds 315 natural backdoors in the 56 normally trained models downloaded from the Internet, covering all the different categories, while existing scanners designed for injected backdoors can at most detect 65 backdoors. We also study the root causes and defense of natural backdoors.
translated by 谷歌翻译
High-quality traffic flow generation is the core module in building simulators for autonomous driving. However, the majority of available simulators are incapable of replicating traffic patterns that accurately reflect the various features of real-world data while also simulating human-like reactive responses to the tested autopilot driving strategies. Taking one step forward to addressing such a problem, we propose Realistic Interactive TrAffic flow (RITA) as an integrated component of existing driving simulators to provide high-quality traffic flow for the evaluation and optimization of the tested driving strategies. RITA is developed with fidelity, diversity, and controllability in consideration, and consists of two core modules called RITABackend and RITAKit. RITABackend is built to support vehicle-wise control and provide traffic generation models from real-world datasets, while RITAKit is developed with easy-to-use interfaces for controllable traffic generation via RITABackend. We demonstrate RITA's capacity to create diversified and high-fidelity traffic simulations in several highly interactive highway scenarios. The experimental findings demonstrate that our produced RITA traffic flows meet all three design goals, hence enhancing the completeness of driving strategy evaluation. Moreover, we showcase the possibility for further improvement of baseline strategies through online fine-tuning with RITA traffic flows.
translated by 谷歌翻译
我们根据计算一个扎根于每个顶点的某个加权树的家族而构成的相似性得分提出了一种有效的图形匹配算法。对于两个erd \ h {o} s-r \'enyi图$ \ mathcal {g}(n,q)$,其边缘通过潜在顶点通信相关联,我们表明该算法正确地匹配了所有范围的范围,除了所有的vertices分数外,有了很高的概率,前提是$ nq \ to \ infty $,而边缘相关系数$ \ rho $满足$ \ rho^2> \ alpha \ ailpha \大约0.338 $,其中$ \ alpha $是Otter的树木计数常数。此外,在理论上是必需的额外条件下,可以精确地匹配。这是第一个以显式常数相关性成功的多项式图匹配算法,并适用于稀疏和密集图。相比之下,以前的方法要么需要$ \ rho = 1-o(1)$,要么仅限于稀疏图。该算法的症结是一个经过精心策划的植根树的家族,称为吊灯,它可以有效地从同一树的计数中提取图形相关性,同时抑制不同树木之间的不良相关性。
translated by 谷歌翻译
为了解决数学单词问题,人类学生利用达到不同方程解决方案的各种推理逻辑。但是,自动求解器的主流序列到序列方法旨在解码通过人类注释监督的固定溶液方程。在本文中,我们通过利用一组控制代码来指导模型考虑某些推理逻辑并解码从人类参考转换的相应方程式表达式来指导模型来考虑某些推理逻辑并解码相应的方程式表达式来提出一个受控方程生成求解器。经验结果表明,我们的方法普遍提高了单人(MATH23K)和多项(draw1k,hmwp)基准的性能,在具有挑战性的多重未知数据集上,高达13.2%的准确性。
translated by 谷歌翻译
本文回顾了AIM 2022上压缩图像和视频超级分辨率的挑战。这项挑战包括两条曲目。轨道1的目标是压缩图像的超分辨率,轨迹〜2靶向压缩视频的超分辨率。在轨道1中,我们使用流行的数据集DIV2K作为培训,验证和测试集。在轨道2中,我们提出了LDV 3.0数据集,其中包含365个视频,包括LDV 2.0数据集(335个视频)和30个其他视频。在这一挑战中,有12支球队和2支球队分别提交了赛道1和赛道2的最终结果。所提出的方法和解决方案衡量了压缩图像和视频上超分辨率的最先进。提出的LDV 3.0数据集可在https://github.com/renyang-home/ldv_dataset上找到。此挑战的首页是在https://github.com/renyang-home/aim22_compresssr。
translated by 谷歌翻译
我们研究了将人类设计师创建的基于图像的,逐步组装手册转换为机器可解剖说明的问题。我们将此问题提出为顺序预测任务:在每个步骤中,我们的模型都读取手册,将要添加到当前形状中的组件定位,并注入其3D姿势。此任务构成了在手动图像和实际3D对象之间建立2D-3D对应关系的挑战,以及对看不见的3D对象的3D姿势估计,因为要在步骤中添加的新组件可以是从前一个步骤中构建的对象。为了应对这两个挑战,我们提出了一个基于学习的新型框架,即手动到执行计划网络(MEPNET),该网络(MEPNET)从一系列手动图像中重建了组装步骤。关键思想是将神经2D关键点检测模块和2D-3D投影算法进行高精度预测和强有力的概括为看不见的组件。 MEPNET在三个新收集的乐高手册数据集和Minecraft House数据集上优于现有方法。
translated by 谷歌翻译
我们解决了一对图像之间找到密集的视觉对应关系的重要任务。由于各种因素,例如质地差,重复的模式,照明变化和运动模糊,这是一个具有挑战性的问题。与使用密集信号基础真相作为本地功能匹配培训的直接监督的方法相反,我们训练3DG-STFM:一种多模式匹配模型(教师),以在3D密集的对应性监督下执行深度一致性,并将知识转移到2D单峰匹配模型(学生)。教师和学生模型均由两个基于变压器的匹配模块组成,这些模块以粗略的方式获得密集的对应关系。教师模型指导学生模型学习RGB诱导的深度信息,以实现粗糙和精细分支的匹配目的。我们还在模型压缩任务上评估了3DG-STFM。据我们所知,3DG-STFM是第一种用于本地功能匹配任务的学生教师学习方法。该实验表明,我们的方法优于室内和室外摄像头姿势估计以及同型估计问题的最先进方法。代码可在以下网址获得:https://github.com/ryan-prime/3dg-stfm。
translated by 谷歌翻译
近年来,Experts(MOE)的混合物已成为一种有前途的深度学习技术,可以将模型能力扩展为万亿多个参数,同时通过稀疏计算降低计算成本。虽然MoE开设了一个非常大的模型的新领域,但由于MOE的动态性质与系统的静态平行性/管道层之间的不匹配,因此其数以千计的GPU的实现受到限制。我们提出了Tutel,这是一种具有动态自适应并行性和管道的高度可扩展的堆栈设计和实现。 TUTEL在运行时提供自适应并行性切换和自适应管道,分别达到1.74倍和2.00倍的单MOE层加速度。我们还提出了一种用于MOE通信速度的新颖的二维层次结构算法,该算法的表现超过了2,048 GPU的先前最先前的最新时间。 Tutel汇总了所有技术,最终在16 GPU和2,048 GPU上分别提供了4.96倍和5.75倍的加速度,分别通过Fairseq:Meta的Facebook AI AI研究序列到序列工具Kit(Tutel(Tutel)(Tutel)(Tutel)(现在由Fairseq部分采用)。 Tutel源代码可在公共场所获得:https://github.com/microsoft/tutel。我们的评估表明,Tutel有效,有效地运行了一个基于现实的MOE模型,名为Swinv2-Moe,建立在Swin Transformer V2上,这是一种最先进的计算机视觉体系结构。在效率方面,Tutel加速了Swinv2-MoE,在FairSeq的训练和推理中分别达到1.55倍和2.11倍的速度。关于有效性,SWINV2-MOE模型在预训练和下游计算机视觉任务(例如可可对象检测)方面都比对应的密度密度模型都达到了卓越的精度,这表明Tutel准备对端到端现实世界模型训练的准备就绪和推理。 Swinv2-Moe在https://github.com/microsoft/swin-transformer中开放。
translated by 谷歌翻译